WORLD INTELLECTUAL PROPERTY ORGANIZATION 
International Bureau 




PCT 

INTERNATIONAL APPLICATION PUBLISHED UNDER THE PATENT COOPERATION TREATY (PCT) 



(51) International Patent Classification 6 : 
C12Q 1/68, G06F 17/30, 159/00 



Al 



(11) International Publication Number: WO 99/05323 

(43) International Publication Date: 4 February 1999 (04.02.99) 



(21) International Application Number: PCT/US98/15151 

(22) International Filing Date: 24 July 1998 (24.07.98) 



(30) Priority Data: 

60/053,842 
60/069,198 
60/069,436 



25 July 1997 (25.07.97) US 
1 1 December 1997 (11.1 2.97) US 
1 1 December 1 997 ( 11 . 1 2.97) US 



(71) Applicant: AFFYMETRIX, INC. [US/US]; 3380 Central Ex- 

pressway, Santa Clara, CA 95051 (US). 

(72) Inventor: BALABAN, David, J.; 37 Bret Harte Road, San 
Rafael, CA 94901 (US). 

(74) Agents: LANG, Dan, H. et al.; Townsend and Townsend 
and Crew LLP, 8th floor, Two Embarcadero Center, San 
Francisco, CA 94111-3834 (US). 



(81) Designated States: JP, European patent (AT, BE, CH, CY, DE, 
DK, ES, FI. FR, CB, GR, IE, IT, LU, MC, NL, PT, SE). 



Published 

With international search report. 



(54) Title: GENE EXPRESSION AND EVALUATION SYSTEM 
(57) Abstract 



An efficient and easy to use query system for a gene expression database. Using such a system, one can easily identify genes 
expressed sequence tags whose expression correlates to particular tissue types. Various tissue types may correspond to different diseas 
states of disease progression, different organs, different species, etc. Researchers may now use large scale gene expression databases to \ 
advantage. 



or 

iseases, 
full 



FOR THE PURPOSES OF INFORMATION ONLY 



Codes used to identify States party to the PCT on the front pages of pamphlets publishing international applications under the PCT. 



AL 


Albania 


ES 


Spain 


AM 


Armenia 


FI 


Finland 


AT 


Austria 


FR 


France 


AU 


Australia 


GA 


Gabon 


AZ 


Azerbaijan 


GB 


Uniied Kingdom 


BA 


Bosnia and Herzegovina 


GE 


Georgia 


BB 


Barbados 


GH 


Ghana 


BE 


Belgium 


GN 


Guinea 


BF 


Burkina Faso 


GR 


Greece 


BG 


Bulgaria 


HU 


Hungary 


BJ 


Benin 


IE 


Ireland 


BR 


Brazil 


IL 


Israel 


BY 


Belarus 


IS 


Iceland 


CA 


Canada 


IT 


Italy 


CF 


Central African Republic 


JP 


Japan 


CG 


Congo 


KE 


Kenya 


CH 


Switzerland 


KG 


Kyrgyzstan 


CI 


Cdtc d'Tvoirc 


KP 


Democratic People's 


CM 


Cameroon 




Republic of Korea 


CN 


China 


KR 


Republic of Korea 


CU 


Cuba 


KZ 


Kazakstan 


CZ 


Czech Republic 


LC 


Saint Lucia 


DE 


Germany 


LI 


Liechtenstein 


DK 


Denmark 


LK 


Sri Lanka 


EE 


Estonia 


LR 


Liberia 



LS Lesotho 

LT Lithuania 

LU Luxembourg 

LV Latvia 

MC Monaco 

MD Republic of Moldova 

MG Madagascar 

MK The former Yugoslav 

Republic of Macedonia 

ML Mali 

MN Mongolia 

MR Mauritania 

MW Malawi 

MX Mexico 

NE Niger 

NL Netherlands 

NO Norway 

NZ New Zealand 

PL Poland 

PT Portugal 

RO Romania 

RU Russian Federation 

SD Sudan 

SE Sweden 

SG Singapore 



SI 


Slovenia 


SK 


Slovakia 


SN 


Senegal 


sz 


Swaziland 


TD 


Chad 


TG 


Togo 


TJ 


Tajikistan 


TM 


Turkmenistan 


TR 


Turkey 


TT 


Trinidad and Tobago 


UA 


Ukraine 


UG 


Uganda 


US 


United States of America 


uz 


Uzbekistan 


VN 


Viet Nam 


YU 


Yugoslavia 


ZW 


Zimbabwe 



WO 99/05323 



PCT/US98/15151 



1 

GENE EXPRESSION AND EVALUATION SYSTEM 

CROSS-REFERENCE TO RELATED APPLICATIONS 
The present application claims priority from U.S. Prov. App. No. 60/053,842 
filed July 25, 1997, entitled COMPREHENSIVE BIO-INFORMATICS DATABASE, from 
U.S. Prov. App. No. 60/069,198 filed on December 11, 1997, entitled COMPREHENSIVE 
DATABASE FOR BIOINFORMATICS , and from U.S. Prov. App. No. 60/069,436, entitled 
GENE EXPRESSION AND EVALUATION SYSTEM, filed on December 11, 1997. The 
contents of all three provisional applications are herein incorporated by reference. 

The subject matter of the present application is related to the subject matter of 
the following three co-assigned applications filed on the same day as the present application: 
METHOD AND APPARATUS FOR PROVIDING A BIOINFORMATICS DATABASE 
(Attorney Docket No. 018547-033810), METHOD AND SYSTEM FOR PROVIDING A 
POLYMORPHISM DATABASE (Attorney Docket No. 018547-033820), METHOD AND 
SYSTEM FOR PROVIDING A PROBE ARRAY CHIP DESIGN DATABASE (Attorney 
Docket No. 018547-033830). The contents of these three applications are herein incorporated 
by reference. 

BACKGROUND OF THE INVENTION 
The present invention relates to computer systems and more particularly to 
computer systems for analyzing expression levels or concentrations. 

Devices and computer systems have been developed for collecting information 
about gene expression or expressed sequence tag (EST) expression in large numbers of tissue 
samples. For example, PCT application WO92/10588, incorporated herein by reference for 
5 all purposes, describes techniques for sequencing or sequence checking nucleic acids and 
other materials. Probes for performing these operations may be formed in arrays according to 
the methods of, for example, the pioneering techniques disclosed in U.S. Patent 
No. 5,143,854 and U.S. Patent No. 5,571,639, both incorporated herein by reference for all 
purposes. 

According to one aspect of the techniques described therein, an array of 
nucleic acid probes is fabricated at known locations on a chip or substrate. A fluorescently 
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labeled nucleic acid is then brought into contact with the chip and a scanner generates an 
image file indicating the locations where the labeled nucleic acids bound to the chip. Based 
upon the identities of the probes at these locations, it becomes possible to extract information 
such as the monomer sequence of DNA or RNA. 

Computer-aided techniques for monitoring gene expression using such arrays 
of probes have been developed as disclosed in EP Pub. No. 0848067 and PCT publication 
No. WO 97/10365, the contents of which are herein incorporated by reference. Many disease 
states are characterized by differences in the expression levels of various genes either through 
changes in the copy number of the genetic DNA or through changes in levels of transcription 
(e.g., through control of initiation, provision of RNA precursors, RNA processing, etc.) of 
particular genes. For example, losses and gains of genetic material play an important role in 
malignant transformation and progression. Furthermore, changes in the expression 
(transcription) levels of particular genes (e.g., oncogenes or tumor suppressors), serve as 
signposts for the presence and progression of various cancers. 

Information on expression of genes or expressed sequence tags may be 
collected on a large scale in many ways, including the probe array techniques described 
above. One of the objectives in collecting this information is the identification of genes or 
ESTs whose expression is of particular importance. Researchers wish to answer questions 
such as: 1) Which genes are expressed in cells of a malignant tumor but not expressed in 
either healthy tissue or tissue treated according to a particular regime? 2) Which genes or 
ESTs are expressed in particular organs but not in others? 3) Which genes or ESTs are 
expressed in particular species but not in others?. 

Collecting vast amounts of expression data from large numbers of samples 
including all the tissue types mentioned above is but the first step in answering these 
questions. To derive full value from the investment made in collecting and storing expression 
data, one must be able to efficiently mine the data to find items of particular relevance. What 
is needed is an efficient and easy to use query system for a gene expression database. 

SUMMARY OF THE INVENTION 
An efficient and easy to use query system for a gene expression database is 
provided by virtue of the present invention. Using such a system, one can easily identify 
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genes or expressed sequence tags whose expression correlates to particular tissue types. 
Various tissue types may correspond to different diseases, states of disease progression, 
different organs, different species, etc. Researchers may now use large scale gene expression 
databases to full advantage. 

According to a first aspect of the present invention, a method is provided in a 
computer system for operating a database storing information about compound concentration. 
The method includes: providing a database including concentrations of a plurality of 
compounds as measured in a plurality of samples, accepting a user query to the database to 
identify desired ones of the plurality of compounds, the user query specifying concentration 
characteristics of the desired compounds in selected ones of the plurality of samples, and 
comparing the concentration characteristics to the concentrations stored in the database to 
identify the desired compounds. 

A further understanding of the nature and advantages of the inventions herein 
may be realized by reference to the remaining portions of the specification and the attached 
drawings. 

BRIEF DESCRIPTION OF THE DRAWINGS 

Fig. 1 illustrates an example of a computer system that may be used to execute 
software embodiments of the present invention. 

Fig. 2 shows a system block diagram of a typical computer system. 

Fig. 3 is a flowchart describing steps of developing expression data according 
to one embodiment of the present invention. 

Fig. 4 is a flowchart describing steps of querying an expression database 
according to one embodiment of the present invention. 

Figs. 5 A-5L depict a user interface for querying an expression database 
according to one embodiment of the present invention. 

DESCRIPTION OF SPECIFIC EMBODIMENTS 
Fig. 1 illustrates an example of a computer system that may be used to execute 
software embodiments of the present invention. Fig. 1 shows a computer system I which 
includes a monitor 3, screen 5, cabinet 7, keyboard 9, and mouse 11. Mouse 1 1 may have 
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one or more burtons such as mouse burtons 13. Cabinet 7 houses a CD-ROM drive 15 and a 
hard drive (not shown) that may be utilized to store and retrieve software programs including 
computer code incorporating the present invention. Although a CD-ROM 17 is shown as the 
computer readable medium, other computer readable media including floppy disks, DRAM, 
hard drives, flash memory, tape, and the like may be utilized. Cabinet 7 also houses familiar 
computer components (not shown) such as a processor, memory, and the like. 

Fig. 2 shows a system block diagram of computer system 1 used to execute 
software embodiments of the present invention. As in Fig. 1, computer system 1 includes 
monitor 3 and keyboard 9. Computer system 1 further includes subsystems such as a central 
processor 50, system memory 52, I/O controller 54, display adapter 56, removable disk 58, 
fixed disk 60, network interface 62, and speaker 64. Removable disk 58 is representative of 
removable computer readable media like floppies, tape, CD-ROM, removable hard drive, 
flash memory, and the like. Fixed disk 60 is representative of an internal hard drive or the 
like. Other computer systems suitable for use with the present invention may include 
additional or fewer subsystems. For example, another computer system could 

include more than one processor 50 (i.e., a multi-processor system) or memory cache. 

Arrows such as 66 represent the system bus architecture of computer system 1. 
However, these arrows are illustrative of any interconnection scheme serving to link the 
subsystems. For example, display adapter 56 may be connected to central processor 50 
through a local bus or the system may include a memory cache. Computer system 1 shown in 
Fig. 2 is but an example of a computer system suitable for use with the present invention. 
Other configurations of subsystems suitable for use with the present invention will be readily 
apparent to one of ordinary skill in the art. In one embodiment, the computer system is an 
IBM compatible personal computer. 

The VLSIPS™ and GeneChip™ technologies provide methods of making and 
using very large arrays of polymers, such as nucleic acids, on very small chips. See U.S. 
Patent No. 5,143,S54 and PCT Patent Publication Nos. WO 90/15070 and 92/10092, each of 
which is hereby incorporated by reference for all purposes. Nucleic acid probes on the chip 
are used to detect complementary nucleic acid sequences in a sample nucleic acid of interest 
(the "target" nucleic acid). 

It should be understood that the probes need not be nucleic acid probes but 
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may also be other polymers such as peptides. Peptide probes may be used to detect the 
concentration of peptides, polypeptides, or polymers in a sample. The probes should be 
carefully selected to have bonding affinity to the compound whose concentration they are to 

be used to measure. 

In one embodiment, the present invention provides methods of reviewing and 
analyzing information relating to the concentration of compounds in a sample as measured by 
monitoring affinity of the compounds to polymers such as polymer probes. In a particular 
application, the concentration information is generated by analysis of hybridization intensity 
files for a chip containing hybridized nucleic acid probes. The hybridization of a nucleic acid 
sample to certain probes may represent the expression level of one more genes or expressed 
sequence tags (EST). The expression level of a gene or EST is herein understood to be the 
concentration within a sample of mRNA or protein that would result from the transcription of 
the gene or EST. 

Expression level information that is reviewed and/or analyzed by virtue of the 
present invention need not be obtained from probes but may originate from any source. If the 
expression information is collected from a probe array, the probe array need not meet any 
particular criteria for size and density. Furthermore, the present invention is not limited to 
reviewing and/or analyzing fluorescent measurements of bondings such as hybridizations but 
may be readily utilized for reviewing and/or analyzing other measurements. 
0 Concentration of compounds other than nucleic acids may be reviewed and/or 

analyzed according to one embodiment of the present invention. For example, a probe array 
may include peptide probes which may be exposed to protein samples, polypeptide samples, 
or peptide samples which may or may not bond to the peptide probes. By appropriate 
selection of the peptide probes, one may detect the presence or absence of particular proteins, 
5 polypeptides, or peptides which would bond to the peptide probes. 

A system that designs a chip mask, synthesizes the probes on the chip, labels 
nucleic acids from a target sample, and scans the hybridized probes is set forth in U.S. Patent 
No. 5,571,639 which is hereby incorporated by reference for all purposes. However, the 
present invention may be used separately for reviewing and/or analyzing the results of other 
5 0 systems for generating expression information, or for reviewing and/or analyzing 
concentrations of polymers other than nucleic acids. 
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The term "perfect match probe" refers to a probe that has a sequence that is 
perfectly complementary to a particular target sequence. The test probe is typically perfectly 
complementary to a portion (subsequence) of the target sequence. The term "mismatch 
control" or "mismatch probe" refer to probes whose sequence is deliberately selected not to 
be perfectly complementary to a particular target sequence. For each mismatch (MM) control 
in an array there typically exists a corresponding perfect match (PM) probe that is perfectly 
complementary to the same particular target sequence. 

Among the important pieces of information obtained from the chips are the 
relative fluorescent intensities obtained from the perfect match probes and mismatch probes. 
These intensity levels are used to estimate an expression level for a gene or EST. The 
computer system used for analysis will preferably have available other details of the 
experiment including possibly the gene name, gene sequence, probe sequences, probe 
locations on the substrate, and the like. 

An expression. analysis is performed for each gene for each experiment. Fig. 3 
is a flowchart describing steps of estimating an expression level for a particular gene as 
measured in a particular experiment on a chip. At step 302, the computer system receives 
raw scan data of N pairs of perfect match and mismatch probes. In a preferred embodiment, 
the hybridization intensities are photon counts from a fluorescein labeled target that has 
hybridized to the probes on the substrate. For simplicity, the hybridization intensity of a 
perfect match probe will be designed "I pm u and the hybridization intensity of a mismatch 

probe will be designed "I mm ." 

Hybridization intensities for a pair of probes are retrieved at step 304. The 
background signal intensity is subtracted from each of the hybridization intensities of the pair 
at step 306. Background subtraction can also be performed on all the raw scan data at the 
same time. 

At step 308, the hybridization intensities of the pair of probes are compared to 
a difference threshold (D) and a ratio threshold (R). It is determined if the difference between 
the hybridization intensities of the pair (I pm - I mm ) is greater than or equal to the difference 
threshold AND the quotierit of the hybridization intensities of the pair (I pm / I mm ) is greater 
than or equal to the ratio threshold. The difference thresholds are typically user defined 
values that have been determined to produce accurate expression monitoring of a gene or 
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genes. In one embodiment, the difference threshold is 20 and the ratio threshold is 1.2. 

If I - I mm >= D and I pm / I mm >= R, the value NPOS is incremented at step 
310. In general, NPOS is a value that indicates the number of pairs of probes which have 
hybridization intensities indicating that the gene is likely expressed. NPOS is utilized in a 
determination of the expression of the gene. 

At step 312, it is determined if I mm - I pm >= D and I mm / I pm >= R. If these 
expressions are true, the value NNEG is incremented at step 314. In general, NNEG is a 
value that indicates the number of pairs of probes which have hybridization intensities 
indicating that the gene is likely not expressed. NNEG, like NPOS, is utilized in a 
determination of the expression of the gene. 

For each pair that exhibits hybridization intensities either indicating the gene is 
expressed or not expressed, a log ratio value (LR) and intensity difference value (IDIF) are 
calculated at step 316. LR is calculated by the log of the quotient of the hybridization 
intensities of the pair (I pm / I mm ). The IDIF is calculated by the difference between the 
hybridization intensities of the pair (I pm - I mm ). If there is a next pair of hybridization 
intensities at step 318, they are retrieved at step 304. 

For each analysis performed certain data is stored in an expression analysis 
database. There is preferably a record for each gene or EST for which the chip measures 
expression. This record includes Fields to hold various pieces of information. One field 
stores an analysis ID to identify the analysis. A result type ID field indicates whether the 
listed expression results indicate that the gene is present, marginal, absent, or unknown based 
on application of a decision matrix to the values PI, P2, P3, and P4. A number_positive field 
shows NPOS. An number_negative field shows NNEG. A number_used field shows the 
number of probes belonging to pairs that incremented NNEG or NPOS. A number_all field 
indicates N. An average log ratio field indicates the average LR for all probe pairs. A 
number _positive_exceeds field indicates the value of NPOS - NNEG. A 
number_negative_exceeds field indicates the value of NNEG - NPOS. An average 
differential intensity field indicates the average IDIF for the probe pairs. A 
number_in_average field indicates the number of probe pairs used in computing the average. 

Steps of operating a user interface to the expression database will now be 
illustrated with reference to Fig. 4. The steps of Fig. 4 may be repeated or may occur in a 
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different order, or one or more steps may be omitted. The discussion of the user interface 
will also refer to Figs. 5A-5L which depict representative screen displays of the user 
interface. 

At step 402, the user selects files of expression analysis-results for querying. 
5 Fig. 5 A illustrates an interface screen where the user may specify expression results files. 
Each file represents one experiment. A table 502 lists the files that have already been 
selected. A given list may be saved for later use by selecting a button 504. A previously 
saved list may be deleted by selecting a button 506. A burton 508 resets the list depicted in 
table 502 to a previously saved version. An import button 512 imports the contents of the 

10 files depicted in table 502 for querying. Within table 502, a file name column lists the file 
names that would be imported by application of import button 512. A code column indicates 
the tissue type for the expression data in each file. A replicate file indicates whether the file 
is a duplicate. A chip design code column indicates the chip design used to generate the data 
for the file. Various other columns (not shown) give further information about the analysis 

15 result data. 

By selecting a select files button 5 14, the user calls up a select files screen 516 
as shown in Fig. 5B. This provides an interactive file search and selection process that does 
not require typing in the file name. Before importing the file list, the user should select a 
species by using a species drop-down list 518 as shown in Fig. 5C. An analysis-type drop 
2 0 down list 519 allows the user to select between a relative expression analysis and an absolute 
expression analysis. 

Fig. 5D shows a normalization form 520 for normalizing imported expression 
results at step 404. The software scales the average difference data generated by the analysis 
routine based on the user's selections on normalization form 520. In a chip variability area 

2 5 522, the user specifies housekeeping genes with known expression levels and selects a scale 

value. The user can elect to either apply or not apply this scale value. If the user elects to 
apply the scale value, each gene expression level measured on a single chip is multiplied by a 
value equal to the desired scaling factor divided by the average of housekeeping expression 
levels measured on that chip. 

3 0 Also on normalization form 520, in a tissue variability area 524, the user may 

select a scale value that applies to data collected from multiple chips and whether or not it is 



WO 99/05323 



PCI7US98/15151 



9 

applied. If this scale value is to be applied, each expression value measured in a chip set is 
multiplied by a factor equal to the scale value divided by the average expression level 
measured over all genes for the entire chip set. A transformation area 526 allows the user to 
select whether negative average difference values are to be converted to positive numbers by 
use of a logarithmic transform. The user can reset all the changes made on normalization 
form 520 by selecting a reset button 528 or apply the selected normalizations and . 
transformations by selecting an apply button 530. 

At step 406, the user filters the large set of experimental data that was 
imported, normalized, and transformed. Fig. 5E depicts a filter experiments form 532. A 
lower table 534 lists the imported experiments and genes or EST and the expression data 
associated with each combination of experiment and gene or EST. An upper table 536 is 
used to enter a query to filter the experiment data in lower table 534. Each column of upper 
table 536 corresponds to a column in lower column 534. Upper table 536 is similar to a 
query by example (QBE) grid as included in Microsoft Access. Predicates are entered in the 
columns of upper table 536 with all the predicates in a single row treated as ANDs and those 
between rows treated as OR's. The results satisfying a given query are displayed in lower 
table 534 upon selection of a filter burton 538. Filters may be saved, deleted, and reset by use 
of appropriately labeled buttons, 540, 542, and 544. A stored filter may be loaded by use of a 
drop-down list 546. Selection of an export button 547 writes the data to an Exel spreadsheet. 

To facilitate further user queries, the user may specify a new field to be used 
as a pivot field for future queries at step 408. Elements of the selected field will become 
columns in the new table. Fig. 5F shows how a pivot value is selected by use of a drop-down 
list 548. The pivot value identifies the expression data that will be listed in the columns of 
lower table 534. Fig. 5G shows a pivot column drop-down list 550 allows selection of a 
particular column of lower table 534 as the pivot field. The entries of the selected column are 
shown in a left list box 552 and moved to a right list box 554 to include them as rows in the 
pivoted table. The user selects arrow keys 556 to add and delete items of right list box 554. 
To perform the pivot operation, the user selects a pivot button 558. 

Fig. 5H depicts a user interface for filtering tissue types as displayed as a 
result of the pivot operation. Lower table 534 shows the result of a pivot operation as 
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described with reference to Figs. 5F-5G. 

Upper table 536 is now used at step 410 to specify a query to filter genes using 
the results of experiments obtained from different tissue types. Again, predicates in a row are 
treated as ANDs. Predicates between rows are treated as ORs. By properly formulating a 
5 query, the user may answer questions such as which genes are up-regulated in normal tissue 
and down-regulated in diseased tissue. The depicted Entrez definition column contains the 
definition column from the public domain Entrez database. The depicted query marked 'like 
"growth 1 " retains those records having the string "growth" as a substring in the designated 
column. 

10 One condition satisfying the depicted query is that a gene have an expression 

level in experiment 4002736D greater than 10 and an expression level in experiment 
4003228 A greater than 10 and less than 0.6 times the expression level in experiment 
4002736D. An alternate condition satisfying the query is that the expression level in 
experiment 4002736D be greater than 10 and the expression level in experiment 4003228A 

15 greater than 10 and greater than 1.4 times the expression level in experiment 4002736D. 

This query determines the genes that have a particular fold change pattern 
between experiment 4003228 A and experiment 4002736D. It will filter out genes for which 
there is no significant fold change between the experiments. Specifically, it finds all genes 
for which the expression level of experiment 4003228A is less than 60% of the expression 

20 level of experiment 4002736D, or for which the expression level of experiment 4003228A is 
greater than 140% of the expression level of experiment 4002736D. Both experiments are 
also constrained to have expression levels greater than 10. 

Filters may be saved or reset by selection of buttons 560 and 562, respectively. 
The records displayed in lower table 534 may be sorted on any column(s), and columns may 

2 5 be hidden, frozen, or repositioned for better viewing. Lower table 534 may also be saved in 

different formats, including a spreadsheet format such as Microsoft Excel, by clicking on an 
export button 564. A saved filter may be accessed via a pull down menu 566 or deleted by 
selection of a delete button 568. Additional information on any gene may be obtained by 
double clicking its row. This will load an Internet browser program and open a web site such 

3 0 as the Entrez web site that stores information for the gene. The browser program then 

displays the entry for that gene. 
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At step 412, by selecting a graph button 570, the user calls up a scatter-plot 
display 572 depicted in Fig. 51. Two experiments are selected for comparison using drop- 
down lists 574 and 576 for the x axis and y axis respectively. The graph is generated by 
selecting a build scatter button 578. Each point on the scatter plot corresponds to a particular 
gene. The point is positioned on the graph according to its measured expression level in both 
experiments. By checking a box 580, the user may select to have the points color coded 
according to whether the gene was present in both (2P), one (IP), or neither (OP) of the 
experiments. By checking one or more of boxes 582, the user may elect to show or not show 
genes according to this categorization. 

By making an appropriate selection in a box 584, the user may select an 
interpretation for future mouse clicks. One choice is for the system to do nothing in response 
to a mouse click. Another choice is for the system to show gene data for a point selected by a 
mouse click. The gene data appears in a box 586 including the accession number, the gene 
name, the expression, levels as measured in a variety of experiments, and an expression call 
for each experiment (either absent or present.) An Entrez definition name is also shown. 
Double clicking on an entry will invoke an Internet browser to show the Entrez entry for the 
gene. 

The user may also select "rope" in box 584 to collect interesting points for 
comparison by surrounding them with a polygon. Lines are automatically drawn between 
0 each mouse click, encircling those genes to be included in a bar graph. The user may display 
the bar graph by selecting a button 588. 

At step 414, Fig. 5 J depicts a bar graph 590 for the roped genes in the scatter 
plot of Fig. 51. Each grouping of bars in Fig. 5J corresponds to a gene. Each bar within a 
grouping corresponds to an experiment and is color-coded according to a legend 592. 
5 Initially only two experiments are displayed, the two experiments corresponding to the axes 
of the scatter plot of Fig. 51. However, the user may select further experiments from a box 
594. Once the desired experiments are selected, the user selects a build button 596 to display 
the desired bar graph. A table 598 shows the expression levels for the depicted genes. 

For the display of Fig. 5 J, the option "gene" is selected in a box 600. To view 
0 individual plots of the expression level for each gene as they vary over the experiments, the 
user may select option "experiment" in box 600 before selecting build button 596. This 
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produces a line graph 602 as shown in Fig. 5K. The experiments are arranged along the 
horizontal axis in the order specified in box 594. Each gene has its own trace corresponding 
to its expression level as it varies over the experiments. A legend 604 identifies the trace for 
each gene. To change the position of an experiment along the horizontal axis, the user uses 
5 up and down arrows 606 and 608 to change its position. This feature makes it possible to 
reorder the experiments to reflect additional sequencing knowledge. For example, if the 
experiments represent a time course such as progression of a disease or treatment, they can be 
graphically ordered in time sequence. The graph then represents the change in expression 
level as a function of time for the selected gene. A slider icon 612 allows the user to scroll 
10 along the horizontal axis if line graph 602 does not fit on the screen. A maker check box 614 
shows a horizontal line across line graph 602 defining a particular expression level. This 
allows the user to easily view data points above the selected level. 

More information about a gene may be obtained by clicking on any bar in the 
group. All of the information for the gene will be displayed in a separate window 610 as 
15 shown in Fig. 5L. 

In the foregoing specification, the invention has been described with reference 
to specific exemplar)^ embodiments thereof. It will, however, be evident that various 
modifications and changes may be made thereunto without departing from the broader spirit 
and scope of the invention as set forth in the appended claims and their full scope of 
equivalents. For example, it will be understood that wherever "expression level' 1 is referred 
to, one may substitute the measured concentration of any compound. Also, wherever "gene" 
is referred to, one may substitute the term "expressed sequence tag." 
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WHAT IS CLAIMED IS: 

1 . In a computer system, a method for operating a database storing 

expression level information comprising: 

providing a database comprising expression levels for each of a plurality of 
genes or expressed sequence tags (EST) as measured in each of a plurality of tissue types; 

accepting a user query to said database to identify desired ones of said 
plurality of genes or EST, said user query specifying expression level characteristics of said 
desired genes; and 

comparing said expression level characteristics to said expression levels stored 
in said database to identify said desired genes or EST. 

2. The method of claim I further comprising: 
displaying information identifying said desired genes or EST. 

3. The method of claim 1 wherein said plurality of tissue types comprise 
a diseased tissue type. 

4. The method of claim 1 wherein said plurality of tissue types comprise 
a healthy tissue type. 

5. The method of claim 1 wherein said plurality of tissue types comprise 
a cancerous tissue type. 

6. The method of claim I wherein said plurality of tissue types comprise 
a drug treated tissue type. 

7. The method of claim 1 wherein said plurality of tissue types comprise 
issues obtained from disparate species. 



8. The method of claim 1 wherein said plurality of tissue types comprise 
tissues obtained from disparate organs. 
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1 9. The method of claim 1 wherein said expression level characteristics 

2 comprise expression level ranges as measured for a particular gene in at least two of said 

3 plurality of tissue types. 

1 10. The method of claim 1 wherein said expression level characteristics 

2 comprise relationships among expression levels as measured for a particular gene in at least 

3 two of said plurality of tissue Types. 

1 11. The method of claim 1 further comprising: 

2 accepting user input selecting two of said plurality tissue types for graphical 

3 display; 

4 displaying a first axis corresponding to a first one of said two tissue types; 

5 displaying a second axis corresponding to a second one of said two tissue 

6 types; 

7 for a selected one of said plurality of genes or EST, displaying a mark at a 

8 position wherein said position is selected relative to said first axis in accordance with an 

9 expression level of said selected gene or EST measured in said first tissue type and selected 

10 relative to said second axis in accordance with an expression level of said selected gene or 

1 1 EST measured in said second tissue type. 

1 12. The method of claim 1 1 further comprising: 

2 repeating said operation of displaying a mark for a plurality of selected genes 

3 or EST. 

1 1 3. In a computer system, a method for operating a database storing 

2 information about compound concentration comprising: 

3 providing a database comprising concentrations of a plurality of compounds as 

4 measured in a plurality of Samples; 

5 accepting a user query to said database to identify desired ones of said 

6 plurality of compounds, said user query specifying concentration characteristics of said 
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7 desired compounds in selected ones of said plurality of samples; and 

8 comparing said concentration characteristics to said concentrations stored in 

9 said database to identify said desired compounds. 

1 14. A computer program product for operating a database storing 

2 expression level information comprising: 

3 code that provides a database comprising expression levels for each of a 

4 plurality of genes or expressed sequence tags (EST) as measured in each of a plurality of 

5 tissue types; 

6 code that accepts a user query to said database to identify desired ones of said 

7 plurality of genes or EST, said user query specifying expression level characteristics of said 

8 desired genes; 

9 code that compares said expression level characteristics to said expression 

10 levels stored in said database to identify said desired genes or EST; and 

1 1 a computer-readable storage medium for storing the codes. 

1 15. The product of claim 14 further comprising: 

2 code that displays information identifying said desired genes or EST. 

1 16. The product of claim 14 wherein said plurality of tissue types comprise 

2 a diseased tissue type. 

1 17. The product of claim 14 wherein said plurality of tissue types comprise 

2 a healthy tissue type. 

1 1 8. The product of claim 14 wherein said plurality of tissue types comprise 

2 a cancerous tissue type. 

1 19. The-product of claim 14 wherein said plurality of tissue types comprise 

2 a drug treated tissue type. 
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1 20. The product of claim 14 wherein said plurality of tissue types comprise 

2 tissues obtained from disparate species. 

1 21. The product of claim 14 wherein said plurality of tissue types comprise 

2 tissues obtained from disparate organs. 

1 22. The product of claim 14 wherein said expression level characteristics 

2 comprise expression level ranges as measured for a particular gene in at least two of said 

3 plurality of tissue types. 

1 23. The product of claim 14 wherein said expression level characteristics 

2 comprise relationships among expression levels as measured for a particular gene in at least 

3 two of said plurality of tissue types. 

1 24. The product of claim 14 further comprising: 

2 code that accepts user input selecting two of said plurality tissue types for 

3 graphical display; 

4 code that displays a first axis corresponding to a first one of said two tissue 

5 types; 

6 code that displays a second axis corresponding to a second one of said two 

7 tissue types; 

8 code that, for a selected one of said plurality of genes or EST, displays a mark 

9 at a position wherein said position is selected relative to said first axis in accordance with an 

10 expression level of said selected gene or EST measured in said first tissue type and selected 

1 1 relative to said second axis in accordance with an expression level of said selected gene or 

12 EST measured in said second tissue type. 

1 25. The product of claim 24 further comprising: 

2 code that repeatedly applies said code that displays a mark for a plurality of 

3 selected genes or EST. 
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1 26. A computer program product for operating a database storing 

2 information about compound concentration comprising: 

3 code that receives a database comprising concentrations of a plurality of 

4 compounds as measured in a plurality of samples; 

5 code that accepts a user query to said database to identify desired ones of said 

6 plurality of compounds, said user query specifying concentration characteristics of said 

7 desired compounds in selected ones of said plurality of samples; and 

8 code that compares said concentration characteristics to said concentrations 

9 stored in said database to identify said desired compounds. 

1 27. A computer system comprising: 

2 a processor; and 

3 a memory storing code to operate said processor, said code comprising: 

4 code that provides a database comprising expression levels for each of a 

5 plurality of genes or expressed sequence tags (EST) as measured in each of a plurality of 

6 tissue types; 

7 code that accepts a user query to said database to identify desired ones of said 

8 plurality of genes or EST, said user query specifying expression level characteristics of said 

9 desired genes; and 

10 code that compares said expression level characteristics to said expression 

1 1 levels stored in said database to identify said desired genes or EST. 
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